Skip to content

Cleanup project tags #5983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 7, 2019
Merged

Cleanup project tags #5983

merged 4 commits into from
Aug 7, 2019

Conversation

davidfischer
Copy link
Contributor

@davidfischer davidfischer commented Jul 23, 2019

  • Adds a management command to clean up tags. Cleanup involves lowercasing, slugifying (canonicalizing) and removing any tags with no project (no tagged items, specifically)
  • Future tags will come in canonicalized
  • Adds a management command to get tags from github for projects where we have no tags.
  • Adds an admin action to import tags from github that shares logic with above.

Notes & Possible Enhancements

  • GitHub actually limits the length of the tag which might not be a bad idea. Basically all tags longer than ~30 characters are certainly spam.

    Screen Shot 2019-07-23 at 11 01 10 AM

  • Limiting the number of tags for a project to some reasonable number like 20 isn't a bad idea.

  • I would definitely like to autocomplete tags but I think that's a task best left until after other design work

- Adds a management command to clean up tags
- Cleanup involves lowercasing and slugifying (canonicalizing)
- Future tags will come in canonicalized
@davidfischer davidfischer requested a review from a team July 23, 2019 23:16
@agjohnson
Copy link
Contributor

Diving into review now, but immediately:

  • Is there any reason not to make commands that will run a single time a data migration instead? If new tags are formatted correctly, we'll need to manually run this command once. If we make this a migration, it happens for all instances automatically. Would this be bad?
  • Is there any reason not to make recurring management commands an admin function instead?

Generally speaking, I think site admin is a better place for most management commands, especially those that don't require input. These commands are easier to find and keeps site admin role abilities encapsulated in the site admin, instead of requiring ops access to servers for some selective functions. These functions are generally lost.

If that is not the correct answer here, that is fine. But I like to steer us clear of management commands where possible. All of our management commands suffer from rot and usually reimplement logic that admin functions give us.

@davidfischer
Copy link
Contributor Author

Is there any reason not to make recurring management commands an admin function instead?

Cleaning tags will probably take on the order of 1-2 hours. Importing tags from github will take on the order of 1-3 days at least the first time. Incremental may be only 1 dayish. I'm not sure an admin function or a data migration is appropriate for that. We could implement it as an admin function that calls the same logic on a per project basis but I do want to run this for all projects.

If new tags are formatted correctly, we'll need to manually run this command once.

This is true for the cleaning aspect but not for deleting unused tags. Deleting unused tags should be fairly quick incrementally. When tags are first cleaned, there will be a lot to delete. For example, all tags with uppercase will be deleted. However, when a bunch of spam projects get deleted in the future, they will leave behind a bunch of spam tags which this management command will clean up on future runs. Possibly reformatting tags and removing unused ones should be separated.

@agjohnson
Copy link
Contributor

Roger, thanks for the background 👍

I think my reservations are mostly around frequency, but the length of run is another point.

Even a 1-2 hour migration is pretty annoying, so a management command is ok, though it seems redundant functionality.

For the tag import, what is the frequency we want to run this? Just once for now to test this data? A management command works as a first pass for bulk import, but I'd want to move this out of a management command eventually if we're running this frequently or this needs to be a scheduled task that ops needs to take. If we're working towards running this more frequently, there are plenty of options to automate this via celery.

I'm guessing this would be something we're running once for now, would it be helpful to run this selectively on projects in addition? This also sounds like later work.

@davidfischer
Copy link
Contributor Author

davidfischer commented Jul 25, 2019

I'm guessing this would be something we're running once for now, would it be helpful to run this selectively on projects in addition? This also sounds like later work.

For the import from github, I'm looking mostly just to run this once although once a year doesn't seem too bad. Having the ability to do it on a per project basis would be nice though. At the same time, connecting to the github API requires an API token which is separate from an app token (I think). Otherwise, the rate limit is ~60/hr.

Canonicalizing tags only ever needs to run once after this is merged as future tags will be canonicalized. Removing unused tags is something I would probably like to run every time we delete a bunch of spam projects. Once spam deletion is more automated, this could also be automated. Deleting unused tags should also be pretty fast overall (a few minutes maybe, I'm guesstimating).

To make this more concrete, how about this proposal:

  • Add an admin function to get tags for a single project. This will use our existing github credentials (client ID and client secret)
  • Leave the import from github as a management command. This takes a long time and is run infrequently. This will call the same admin function as above except for all projects and it will space them out so we don't hit rate limits.
  • Add an admin function to delete unused tags in the projects app.
  • Leave the canonicalizing tags as a management command. It only needs to be run once. At the end, it will call the same logic as the admin function to delete unused tags.

Edit: updated proposal since we can re-use our existing github credentials

@davidfischer
Copy link
Contributor Author

At the same time, connecting to the github API requires an API token which is separate from an app token (I think).

It looks like we can use our app credentials (client ID and client secret). It doesn't give any permissions but does increase the rate limit which is all we need for this.

@davidfischer
Copy link
Contributor Author

Add an admin function to delete unused tags in the projects app.

This is a bit awkward as you must select projects in order to do it and then those are ignored. Maybe this is best as a management command for now until spam deletion is automated and then this can be a celery task.

Screen Shot 2019-07-25 at 11 34 43 AM

- Make importing tags for a project an admin function
- Use the github app Oauth credentials
- Make removing unused tags a reusable function
  so it can be reused in the future by celery
@codecov
Copy link

codecov bot commented Jul 25, 2019

Codecov Report

Merging #5983 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5983      +/-   ##
==========================================
+ Coverage   79.14%   79.15%   +0.01%     
==========================================
  Files         175      176       +1     
  Lines       11127    11133       +6     
  Branches     1375     1377       +2     
==========================================
+ Hits         8806     8812       +6     
  Misses       1964     1964              
  Partials      357      357
Impacted Files Coverage Δ
readthedocs/core/tag_utils.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1d4b8a...a3e47b5. Read the comment docs.

@agjohnson
Copy link
Contributor

This is a bit awkward as you must select projects in order to do it and then those are ignored

Yeah I agree, it sounds like management command works fine here. There's probably a pattern for admin action against all projects, short of selecting "all projects" in the UI selector. I think ultimately full automation would the way to go with some of these functoins, and leave fiddling with the admin out of it. I do see some potential to selectively choose Projects from the admin though.

@davidfischer
Copy link
Contributor Author

I do see some potential to selectively choose Projects from the admin though.

I added an admin function to get tags for a single project and used the same logic in the management command.

Regardless, I think this is ready for a full review.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. 👍 It would be useful to have some tests here eventually, but getting it shipped is more valuable to me.

help = __doc__

def handle(self, *args, **options):
queryset = Project.objects.filter(tags=None).filter(repo__contains='github.com')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should live in a manager for github/gitlab/bitbucket and other providers we support. I've seen these types of queries scattered around a lot, and would be good to standardize on an approach. Not necessary in this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree and I wouldn't mind expanding them to the other providers.

self.stdout.write('Set tags for {}: {}'.format(project.slug, tags))

# Sleeping half a second should keep us under 5k requests/hour
time.sleep(0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. 👍

@ericholscher ericholscher merged commit 28a8303 into master Aug 7, 2019
@ericholscher ericholscher deleted the davidfischer/tags-cleanup branch August 7, 2019 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants